Quick Introduction to Pandas

About Me:
Aditya Laghate aka @ThinRhino
A pseudo geek - Currently learning Math, Statistics & Data Analytics


In [1]:
# imports
from pandas import Series, DataFrame
import pandas as pd
pandas provides rich data structures and functions designed to make working with structured data fast, easy, and expressive. The primary object in pandas that will be used in this book is the DataFrame, a two- dimensional tabular, column-oriented data structure with both row and column labels. pandas combines the high performance array-computing features of NumPy with the flexible data manipulation capabilities of spreadsheets and relational databases (such as SQL). It provides sophisticated indexing functionality to make it easy to reshape, slice and dice, perform aggregations, and select subsets of data.

Pandas: Basic datastructes: Series

A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index.


In [2]:
# Series
obj = Series([-4, 3, 6, 6])
obj


Out[2]:
0   -4
1    3
2    6
3    6
dtype: int64

In [3]:
obj.values


Out[3]:
array([-4,  3,  6,  6])

In [4]:
obj.index


Out[4]:
Int64Index([0, 1, 2, 3], dtype=int64)

In [5]:
obj2 = Series([2, 5, -5, 3], index=['a', 'b', 'c', 'd'])
obj2


Out[5]:
a    2
b    5
c   -5
d    3
dtype: int64

In [6]:
obj2.values, obj2.index


Out[6]:
(array([ 2,  5, -5,  3]), Index([u'a', u'b', u'c', u'd'], dtype=object))

In [8]:
# NumPy array operations
obj2['a']


Out[8]:
2

In [9]:
obj2 * 2


Out[9]:
a     4
b    10
c   -10
d     6
dtype: int64

In [10]:
# Converting Python dict to a series

sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000,}
obj3 = Series(sdata)
obj3


Out[10]:
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

In [11]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states)
obj4


Out[11]:
California      NaN
Ohio          35000
Oregon        16000
Texas         71000
dtype: float64

In [13]:
obj4.dropna()


Out[13]:
Ohio      35000
Oregon    16000
Texas     71000
dtype: float64

In [14]:
obj3 + obj4


Out[14]:
California       NaN
Ohio           70000
Oregon         32000
Texas         142000
Utah             NaN
dtype: float64

DataFrame

A DataFrame represents a tabular, spreadsheet-like data structure containing an or- dered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series (one for all sharing the same index).


In [15]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
frame


Out[15]:
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002

In [16]:
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],index=['one', 'two', 'three', 'four', 'five'])
frame2


Out[16]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN

In [17]:
frame2['debt'] = 16.5
frame2


Out[17]:
year state pop debt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5

In [18]:
frame2['debt'] = np.arange(5.)
frame2


Out[18]:
year state pop debt
one 2000 Ohio 1.5 0
two 2001 Ohio 1.7 1
three 2002 Ohio 3.6 2
four 2001 Nevada 2.4 3
five 2002 Nevada 2.9 4

In [19]:
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2


Out[19]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7

In [20]:
frame2['eastern'] = frame2.state == 'Ohio'
frame2


Out[20]:
year state pop debt eastern
one 2000 Ohio 1.5 NaN True
two 2001 Ohio 1.7 -1.2 True
three 2002 Ohio 3.6 NaN True
four 2001 Nevada 2.4 -1.5 False
five 2002 Nevada 2.9 -1.7 False

In [21]:
population = {'Nevada': {2001: 2.4, 2002: 2.9},
              'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = DataFrame(population)
frame3


Out[21]:
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6

In [22]:
frame3.T


Out[22]:
2000 2001 2002
Nevada NaN 2.4 2.9
Ohio 1.5 1.7 3.6

Examples + Functionality


In [23]:
# input data
# Cannot share data!
df = pd.read_excel('/Users/aditya/PLUG/pandas/chap_04.xls', 'Sheet1')
df


Out[23]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16416 entries, 0 to 16415
Data columns (total 9 columns):
id           16416  non-null values
to_port      16416  non-null values
from_port    16416  non-null values
hs_code      16416  non-null values
descp        16416  non-null values
qty          16416  non-null values
unit         16416  non-null values
value        16416  non-null values
date         16416  non-null values
dtypes: float64(4), object(5)

In [24]:
df.head()


Out[24]:
id to_port from_port hs_code descp qty unit value date
0 13992 DUBAI INAMD4 4059090 MILK MAWA 500 KGS 88698.75 2013-02-01
1 30310 DUBAI INAMD4 4059090 MILK MAWA 700 KGS 119600.25 2012-11-02
2 37135 BANGKOK INAMD4 4039090 CREAM-35 2 NOS 42504.00 2012-10-03
3 47681 MAPUTO INAMD4 4089900 EMU EGGS 18 PCS 21231.30 2013-01-05
4 53994 CAPE TOWN INAMD4 4051000 PEANUT BUTTER SMOOTH HAVING NO. COMM. VALUE VA... 13 PAC 8898.89 2012-10-05

In [25]:
# Change index and convert string to datetime
df.index = pd.to_datetime(df.pop('date'), format='%Y-%m-%d')

In [26]:
df


Out[26]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 16416 entries, 2013-02-01 00:00:00 to 2012-01-08 00:00:00
Data columns (total 8 columns):
id           16416  non-null values
to_port      16416  non-null values
from_port    16416  non-null values
hs_code      16416  non-null values
descp        16416  non-null values
qty          16416  non-null values
unit         16416  non-null values
value        16416  non-null values
dtypes: float64(4), object(4)

In [27]:
df.head()


Out[27]:
id to_port from_port hs_code descp qty unit value
date
2013-02-01 13992 DUBAI INAMD4 4059090 MILK MAWA 500 KGS 88698.75
2012-11-02 30310 DUBAI INAMD4 4059090 MILK MAWA 700 KGS 119600.25
2012-10-03 37135 BANGKOK INAMD4 4039090 CREAM-35 2 NOS 42504.00
2013-01-05 47681 MAPUTO INAMD4 4089900 EMU EGGS 18 PCS 21231.30
2012-10-05 53994 CAPE TOWN INAMD4 4051000 PEANUT BUTTER SMOOTH HAVING NO. COMM. VALUE VA... 13 PAC 8898.89

In [28]:
# Monthly mean
mth_mean = df.value.resample('M', how=['mean', 'median', 'std'])
mth_mean.head()


Out[28]:
mean median std
date
2012-01-31 959229.921205 386905.455 1285043.331985
2012-02-29 1058314.492339 368042.130 1478173.074670
2012-03-31 1176705.192077 436196.825 1573295.375182
2012-04-30 1493411.445170 754509.750 1883248.945086
2012-05-31 1064001.441176 307954.995 1581324.286834

In [29]:
mth_mean.plot()


Out[29]:
<matplotlib.axes.AxesSubplot at 0x108d9e690>

In [30]:
hs_code = df.groupby([df.hs_code, df.index])['value'].mean()
hs_code


Out[30]:
hs_code  date      
4011000  2012-02-03      2780.780000
         2012-03-08    246420.000000
         2012-03-15    352836.000000
         2012-06-11    187621.000000
         2012-07-11     38397.333333
         2012-07-19    117491.500000
         2012-07-26    131503.133333
         2012-08-04    159472.060000
         2012-08-16    375526.000000
         2012-09-01      4825.960000
         2012-09-07    105120.936667
         2012-09-12    137767.000000
         2012-10-12     67138.000000
         2012-10-16    143940.800000
         2012-11-02    280596.600000
...
4090000  2013-09-21    2063096.503333
         2013-09-22      25512.810000
         2013-09-23    3739585.000000
         2013-09-24    1851325.433333
4100020  2013-01-27    3195802.400000
4100090  2012-07-27      79242.500000
         2012-10-16      28404.975000
         2012-12-29      26002.660000
         2013-02-14    1284040.730000
         2013-02-20     201984.230000
         2013-03-21       9887.370000
         2013-06-18        770.740000
         2013-07-23       1788.830000
         2013-08-26     429075.250000
         2013-09-19       7419.510000
Name: value, Length: 4861, dtype: float64

In [31]:
df.from_port.unique()


Out[31]:
array([u'INAMD4', u'INASR6', u'INBFR6', u'INBLJ6', u'INBLR4', u'INBOM4',
       u'INBRC6', u'INCCJ4', u'INCCU1', u'INCCU4', u'INCOK1', u'INCOK4',
       u'INCPL6', u'INDEL4', u'INDER6', u'INDIG6', u'INGHR6', u'INHYD4',
       u'INIGU6', u'ININD6', u'INIXY1', u'INJAI4', u'INJNR6', u'INJUC6',
       u'INKKU6', u'INKNU6', u'INLDH6', u'INLON6', u'INMAA1', u'INMAA4',
       u'INMUL6', u'INMUN1', u'INNGP6', u'INNML1', u'INNSA1', u'INPAV1',
       u'INPNK6', u'INPPG6', u'INPTL6', u'INPTPB', u'INRXLB', u'INSAU6',
       u'INSBI6', u'INSLL6', u'INSNF6', u'INSTT6', u'INTDE6', u'INTKD6',
       u'INTLG6', u'INTRV4', u'INTUT1', u'INTUT6', u'INTVT6', u'INVTZ1',
       u'INWFD6', u'INCJB4'], dtype=object)

In [33]:
grp_from_port = df['value'].groupby(df.from_port).mean()
grp_from_port


Out[33]:
from_port
INAMD4          50343.243529
INASR6          80679.493333
INBFR6       17108924.530500
INBLJ6       21627388.000000
INBLR4         936275.835472
INBOM4         456250.207978
INBRC6           3216.150000
INCCJ4        1410133.738500
INCCU1        1752904.833333
INCCU4          79375.828169
INCJB4            216.140000
INCOK1        4153438.319750
INCOK4        1318740.454655
INCPL6        3257838.265639
INDEL4         127494.707753
INDER6        4367968.702348
INDIG6        3699249.951371
INGHR6        3106688.736522
INHYD4         197461.645135
INIGU6         556976.160000
ININD6         175183.656667
INIXY1          82307.460000
INJAI4           6162.625000
INJNR6         272113.560000
INJUC6          80625.532222
INKKU6         103572.702381
INKNU6        1585317.140000
INLDH6        2001159.927033
INLON6        4145742.874536
INMAA1        1937629.396295
INMAA4         168402.481327
INMUL6        5142684.375000
INMUN1         241851.347842
INNGP6         158273.442644
INNML1          53109.305000
INNSA1        2634457.914446
INPAV1           1659.690000
INPNK6        1800815.610000
INPPG6         126199.658000
INPTL6        3601896.675000
INPTPB         646107.514316
INRXLB         324754.218339
INSAU6        8978054.000000
INSBI6        3996768.823058
INSLL6         932813.600294
INSNF6        2284636.987692
INSTT6        1981081.200000
INTDE6         334989.260000
INTKD6        3853034.473504
INTLG6         851770.669545
INTRV4          38729.078286
INTUT1        1660852.233313
INTUT6         268119.810000
INTVT6        2538729.798889
INVTZ1         403831.782000
INWFD6        2736054.635055
Length: 56, dtype: float64

In [34]:
grp_from_port[:10].plot(kind='barh',color='k', alpha=0.7)


Out[34]:
<matplotlib.axes.AxesSubplot at 0x109123b10>

In [35]:
from_port = df.groupby(df.from_port)['value'].mean()
from_port[:10].plot(kind='bar')


Out[35]:
<matplotlib.axes.AxesSubplot at 0x108d90e10>

In [36]:
new_df = DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
                    'key2': ['one', 'two', 'one', 'two', 'one'],
                    'data1': np.random.randn(5),
                    'data2': np.random.randn(5)})
new_df


Out[36]:
data1 data2 key1 key2
0 -0.177947 1.585618 a one
1 -1.883930 -0.746285 a two
2 -1.262905 -0.072472 b one
3 1.163769 -0.125257 b two
4 0.415085 0.787418 a one

In [37]:
grouped = new_df['data1'].groupby(new_df['key1'])
grouped


Out[37]:
<pandas.core.groupby.SeriesGroupBy object at 0x109136cd0>

In [38]:
grouped.mean()


Out[38]:
key1
a      -0.548931
b      -0.049568
dtype: float64

Other Datasets


In [39]:
# https://github.com/pydata/pydata-book/blob/master/ch08/tips.csv

tips = pd.read_csv('/Users/aditya/Repo/pydata-book/ch08/tips.csv')
# tips.describe()
tips


Out[39]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

In [45]:
tips.tail(10)


Out[45]:
total_bill tip sex smoker day time size
234 15.53 3.00 Male Yes Sat Dinner 2
235 10.07 1.25 Male No Sat Dinner 2
236 12.60 1.00 Male Yes Sat Dinner 2
237 32.83 1.17 Male Yes Sat Dinner 2
238 35.83 4.67 Female No Sat Dinner 3
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2

In [46]:
# Add tips percentage of total bill
tips['tip_pct'] = tips.tip / tips.total_bill
tips.head()


Out[46]:
total_bill tip sex smoker day time size tip_pct
0 16.99 1.01 Female No Sun Dinner 2 0.059447
1 10.34 1.66 Male No Sun Dinner 3 0.160542
2 21.01 3.50 Male No Sun Dinner 3 0.166587
3 23.68 3.31 Male No Sun Dinner 2 0.139780
4 24.59 3.61 Female No Sun Dinner 4 0.146808

In [47]:
grouped = tips.groupby([tips.sex, tips.smoker])
grouped_pct = grouped['tip_pct']
grouped_pct.agg('mean')


Out[47]:
sex     smoker
Female  No        0.156921
        Yes       0.182150
Male    No        0.160669
        Yes       0.152771
Name: tip_pct, dtype: float64

In [48]:
functions = ['count', 'mean', 'max']
result = grouped['tip_pct', 'total_bill'].agg(functions)
result


Out[48]:
tip_pct total_bill
count mean max count mean max
sex smoker
Female No 54 0.156921 0.252672 54 18.105185 35.83
Yes 33 0.182150 0.416667 33 17.977879 44.30
Male No 97 0.160669 0.291990 97 19.791237 48.33
Yes 60 0.152771 0.710345 60 22.284500 50.81

In [49]:
def top(df, n=5, column='tip_pct'):
    return df.sort_index(by=column)[-n:]

top(tips, n=10)


Out[49]:
total_bill tip sex smoker day time size tip_pct
51 10.29 2.60 Female No Sun Dinner 2 0.252672
221 13.42 3.48 Female Yes Fri Lunch 2 0.259314
93 16.32 4.30 Female Yes Fri Dinner 2 0.263480
149 7.51 2.00 Male No Thur Lunch 2 0.266312
109 14.31 4.00 Female Yes Sat Dinner 2 0.279525
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
232 11.61 3.39 Male No Sat Dinner 2 0.291990
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345

In [50]:
tips.groupby('smoker').apply(top)


Out[50]:
total_bill tip sex smoker day time size tip_pct
smoker
No 88 24.71 5.85 Male No Thur Lunch 2 0.236746
185 20.69 5.00 Male No Sun Dinner 5 0.241663
51 10.29 2.60 Female No Sun Dinner 2 0.252672
149 7.51 2.00 Male No Thur Lunch 2 0.266312
232 11.61 3.39 Male No Sat Dinner 2 0.291990
Yes 109 14.31 4.00 Female Yes Sat Dinner 2 0.279525
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345

In [51]:
tips.groupby(['smoker', 'day']).apply(top, n=1, column='total_bill')


Out[51]:
total_bill tip sex smoker day time size tip_pct
smoker day
No Fri 94 22.75 3.25 Female No Fri Dinner 2 0.142857
Sat 212 48.33 9.00 Male No Sat Dinner 4 0.186220
Sun 156 48.17 5.00 Male No Sun Dinner 6 0.103799
Thur 142 41.19 5.00 Male No Thur Lunch 5 0.121389
Yes Fri 95 40.17 4.73 Male Yes Fri Dinner 4 0.117750
Sat 170 50.81 10.00 Male Yes Sat Dinner 3 0.196812
Sun 182 45.35 3.50 Male Yes Sun Dinner 3 0.077178
Thur 197 43.11 5.00 Female Yes Thur Lunch 4 0.115982

More Plots


In [52]:
df = DataFrame(np.random.rand(6, 4),
               index=['one', 'two', 'three', 'four', 'five', 'six'],
               columns=pd.Index(['A', 'B', 'C', 'D'], name='Genus'))
df


Out[52]:
Genus A B C D
one 0.667401 0.899211 0.751784 0.726842
two 0.842748 0.609299 0.594116 0.709191
three 0.423631 0.283005 0.585332 0.454224
four 0.847370 0.204184 0.169221 0.885130
five 0.754614 0.744178 0.518544 0.995939
six 0.836338 0.652472 0.013708 0.653907

In [53]:
df.plot(kind='bar')


Out[53]:
<matplotlib.axes.AxesSubplot at 0x108ba5cd0>

In [54]:
df.plot(kind='barh', stacked=True, alpha=0.5)


Out[54]:
<matplotlib.axes.AxesSubplot at 0x108bd4810>

In [55]:
party_counts = pd.crosstab(tips.day, tips.size)
party_counts


Out[55]:
size 1 2 3 4 5 6
day
Fri 1 16 1 1 0 0
Sat 2 53 18 13 1 0
Sun 0 39 15 18 3 1
Thur 1 48 4 5 1 3

In [56]:
# Not many parties with size 1(!) and 6
# Eliminating
party_counts = party_counts.ix[:, 2:5]
party_counts


Out[56]:
size 2 3 4 5
day
Fri 16 1 1 0
Sat 53 18 13 1
Sun 39 15 18 3
Thur 48 4 5 1

In [57]:
# Normalize to sum to 1
party_pcts = party_counts.div(party_counts.sum(1).astype(float), axis=0)
party_pcts


Out[57]:
size 2 3 4 5
day
Fri 0.888889 0.055556 0.055556 0.000000
Sat 0.623529 0.211765 0.152941 0.011765
Sun 0.520000 0.200000 0.240000 0.040000
Thur 0.827586 0.068966 0.086207 0.017241

In [58]:
party_pcts.plot(kind='barh', stacked=True, alpha=0.4)


Out[58]:
<matplotlib.axes.AxesSubplot at 0x108c77390>

Missing values


In [59]:
s = Series(np.random.randn(6))
s


Out[59]:
0   -0.024049
1    1.378600
2   -0.391518
3    0.958736
4    0.153957
5   -1.135608
dtype: float64

In [60]:
s[::2] = np.nan
s


Out[60]:
0         NaN
1    1.378600
2         NaN
3    0.958736
4         NaN
5   -1.135608
dtype: float64

In [61]:
s.fillna(s.mean())


Out[61]:
0    0.400576
1    1.378600
2    0.400576
3    0.958736
4    0.400576
5   -1.135608
dtype: float64

Pivot Tables

A pivot table is a data summarization tool frequently found in spreadsheet programs and other data analysis software. It aggregates a table of data by one or more keys, arranging the data in a rectangle with some of the group keys along the rows and some along the columns.


In [62]:
tips.pivot_table(rows=['sex', 'smoker'])


Out[62]:
size tip tip_pct total_bill
sex smoker
Female No 2.592593 2.773519 0.156921 18.105185
Yes 2.242424 2.931515 0.182150 17.977879
Male No 2.711340 3.113402 0.160669 19.791237
Yes 2.500000 3.051167 0.152771 22.284500

In [63]:
# To use the aggregation function, pass it to aggfunc. For example, 'count' or len will give you a 
# cross-tabulation (count or frequency) of group sizes

tips.pivot_table('tip_pct', rows=['sex', 'smoker'], cols='day', aggfunc=len, margins=True)


Out[63]:
day Fri Sat Sun Thur All
sex smoker
Female No 2 13 14 25 54
Yes 7 15 4 7 33
Male No 2 32 43 20 97
Yes 8 27 15 10 60
All 19 87 76 62 244